Question 1

  1. Start with a basic exploratory data analysis. Show summary statistics of the response variable and predictor variable.
df = read.csv("E:\\Linder_college\\Linear Regression\\dataset\\alumni.csv")

Summary Statistics for Percent of classes under 20:

summary(df$percent_of_classes_under_20)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00

Summary Statistics for Alumni giving rate:

summary(df$alumni_giving_rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00
  1. What is the nature of the variables X and Y? Are there outliers? What is the correlation coefficient? Draw a scatter plot. Any major comments about the data?
plot(df$percent_of_classes_under_20, df$alumni_giving_rate, pch=20, xlab = "Percent of Classes Under 20",ylab = "Alumni Giving Rate", main = "Percent of Classes Under 20 VS Alumni Giving Rate")

cat("Correlation coefficient is:\n")
## Correlation coefficient is:
cor(df$percent_of_classes_under_20,df$alumni_giving_rate)
## [1] 0.6456504
hist(df$percent_of_classes_under_20, xlab = "Percent of Classes Under 20",main = "Histogram of Percent of Classes Under 20 ")

hist(df$alumni_giving_rate, xlab = "Alumni giving Rate", main = "Histogram of Alumni Giving Rate")

The data for the predictor and the response variable appears to be continiuos

boxplot(df$percent_of_classes_under_20,main = "Box Plot of Percent of Classes Under 20 ", ylab = "Percent of Classes Under 20")

boxplot(df$alumni_giving_rate, main = "Box Plot of Alumni Giving Rate", ylab = "Alumni Giving Rate")

We can infer from the plot above that there are no outliers

We observe that the data to for both the predictor and the response variable to be continiuos also we see that data has a positive slope, hence we can say that there is a positive correlation between both variables.

The data does not contain any outliers for the predictor and the response variable

However the data does appear to be scattered as we have the correlation coefficient to be 0.646. From the data we see that for the predictor variable (percent of classes below 20 percent) between 60 to 70 percent have a higher coefficient of correlation as compared to data from 30 to 60 percent.

  1. Fit a simple linear regression to the data. What is your estimated regression equation?
plot(df$percent_of_classes_under_20, df$alumni_giving_rate, pch=20, xlab = "Percent of Classes Under 20",ylab = "Alumni Giving Rate", main = " Linear Regression Plot: Percent of Classes Under 20 VS Alumni Giving Rate")
abline(lm(df$alumni_giving_rate ~ df$percent_of_classes_under_20),lwd=1.5)

model1 = lm(formula = df$alumni_giving_rate ~ df$percent_of_classes_under_20)

model1
## 
## Call:
## lm(formula = df$alumni_giving_rate ~ df$percent_of_classes_under_20)
## 
## Coefficients:
##                    (Intercept)  df$percent_of_classes_under_20  
##                        -7.3861                          0.6578

The estimated regression equation is Y = -7.3861 + 0.6578X

  1. Interpret your results.
summary(model1)
## 
## Call:
## lm(formula = df$alumni_giving_rate ~ df$percent_of_classes_under_20)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -7.3861     6.5655  -1.125    0.266    
## df$percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

We see that the for increase in percent of classes under 20 there will be an increase in the response variable alumini_giving_rate by the factor of 0.6578.Also We get the p value to be less than 0.05, hence the variable percent_of_classes_under_20 is statistically significant to predict the rate of alumni giving donation.

Since the residual error is 10.38 we can say that our model has relatively good fit and can be used for prediction of alumni giving donations.

The multiple R-squared value of 0.4169 reveals that approximately 41.69% of the variance in the response variable, alumini_giving_rate, can be accounted for by the predictor variable percent_of_classes_under_20.

Additionally, the F-statistic of 32.88, coupled with a p-value less than 0.05, signifies that the model as a whole holds strong statistical significance in elucidating the variations observed in alumini_giving_rate.

Moreover, the adjusted R-squared, at 0.40, indicates that around 40% of the variability in alumini_giving_rate is explained by the predictor variable percent_of_classes_under_20 within our model.

Question 2

  1. Generating the Data
set.seed(7052)

x = rnorm(n = 100, mean = 2, sd = 0.1)

error_data = rnorm(n = 100, mean = 0, sd = 0.5) 

y = 10 + 5*x + error_data
  1. Show summary statistics of the response variable and predictor variable. Are there outliers? What is the correlation coefficient? Draw a scatter plot.
cat("Summary Statistics for X\n")
## Summary Statistics for X
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.725   1.923   2.001   2.004   2.070   2.243
cat("\nSummary Statistics for Y \n")
## 
## Summary Statistics for Y
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.09   19.67   20.11   20.17   20.70   21.80
hist(x, xlab = "Predictor Variable X", main = "Histogram of Predictor Variable X")

hist(y, ylab = "Response Variable Y", main = "Histogram of Response Variable Y")

cat("Correlation coefficient for data \n")
## Correlation coefficient for data
cor(x,y)
## [1] 0.8042198

Scatter Plot

plot(x,y, pch=20, xlab = "Predictor Variable X", ylab = "Response Variable Y", main = "Predictor Variable X VS Response Variable Y" )

boxplot(x,xlab = "Predictor Variable X", main = "Box Plot of Predictor Variable X" )

boxplot(y,  ylab = "Response Variable Y", main = "Box Plot of Response Variable Y")

As we have the correllation coefficient to be pretty close to 1 we can say that there is a strong relation between the X and Y variable, also we donot see any outliers present in the data for X and Y.

  1. Fit a simple linear regression. What is the estimated model? Report the estimated coefficients. What is the model mean squared error (MSE)?
model = lm(formula = y ~ x)

model_summ = summary(model)

model_summ
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## x             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16

The estimated regression equation is Y = 9.0218 + 5.5652X

The estimated coefficients are : intercept: 9.0218 , X: 5.5652

mean(model_summ$residuals^2)
## [1] 0.1992276

The model mean squared error (MSE) is 0.1992

  1. What is the sample mean of both X and Y? Plot the fitted regression line and the point (X¯,Y¯). What do you find?
x_mean = mean(x)
y_mean = mean(y)

cat("Sampe mean of X",x_mean)
## Sampe mean of X 2.003677
cat("\n")
cat("Sampe mean of Y",y_mean)
## Sampe mean of Y 20.17258
plot(x, y, pch=20, xlab = "Predictor Variable X", ylab = "Response Variable Y", main = "Liner Regression Plot:  X VS Y")

abline(lm(y~x),lwd=1.5)

points(x_mean, y_mean,col = 'red',pch=20)

Here we can observe the the point (X¯,Y¯) lie on the regression line, this implies that the regression model is well-fitted and accurately represents the relationship between these two variables. In this scenario, the regression line effectively captures the central tendency of the data.

Question 3

Question 4